STAT5003: Project Plan and EDA

Workshop 11 Group 01

Authors
Affiliations

jche0758

SID: 520110054

lals0119

SID: 540615841

nals0930

SID: 540927401

nkha0389

SID: 540829493

yaff0377

SID: 530784645

Published

April 13, 2025

Code
Code
#load libraries
library(tidyverse)  #for data science, loads other core libraries
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Code
library(kableExtra) #for table styling

Attaching package: 'kableExtra'

The following object is masked from 'package:dplyr':

    group_rows
Code
library(scales)     #for scaling axes

Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor
Code
library(reshape2)   #for reshaping data

Attaching package: 'reshape2'

The following object is masked from 'package:tidyr':

    smiths
Code
library(fmsb)       #for radar/spider charts
library(patchwork)  #for combining ggplots
library(plotly)     #for interactive ggplots

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout
Code
#load the dataset
diabetic_data <- read_csv("diabetic_data.csv", show_col_types = FALSE)

1. Problem Definition

Hospital readmissions are a critical healthcare problem due to the significant impact they have on patient’s health, healthcare costs, and the overall efficiency of the healthcare system. Hospital readmissions, primarily those occurring within 30 days of discharge, are a key indicator of the healthcare quality provided to patients with chronic conditions like diabetes. Predicting diabetic patient readmissions is significant for improving the overall healthcare quality and implementing proper post-discharge support and intervention plans, ultimately improving patient’s long-term health and reducing unnecessary healthcare costs (Sharma et al. 2019).

The classification task this project aims to tackle is to predict diabetic patient readmission status within 30 days of discharge, using data collected from 130 United States (US) hospitals between 1999 and 2008. The dataset consists of various attributes on patient demographics, medical and hospitalization records, and treatment procedures, providing details on the factors contributing to patient readmission status (Strack et al. 2014). This is a multi-class classification task, where the target variable readmitted \(y\) has three distinct classes.

\[ y = \begin{cases} <30 & \text{(patient was readmitted within 30 days)} \\ >30 & \text{(patient was readmitted after 30 days)} \\ \text{No} & \text{(patient was not readmitted)} \end{cases} \]

Developing a classification model to predict diabetic patient readmission status will contribute to improving healthcare quality, improving patient health and long-term health outcomes, avoiding unnecessary readmission costs, supporting clinical decision-making, and improving hospital’s operational efficiency (V R Reji Raj and Mr. Rasheed Ahammed Azad. V 2023).

2. Data Description

The dataset used in this assignment titled “Diabetes 130-US Hospitals for Years 1999-2008” was obtained from the Health Facts database, a national warehouse that collects comprehensive clinical records from hospitals across the US (Strack et al. 2014). The raw dataset contains 101,767 records and 50 attributes, collected from 130 hospitals between 1999 and 2008. The dataset includes categorical features representing patient and hospital detailes, such as race, gender, medical specialty, diagnoses, 23 medication-related features (e.g., metformin, insulin, etc.), and numerical attributes such as patient number, admission type and number of lab procedures.

Data Dictionary
Code
#create table for dataset features name and type
datatype_tbl <- tibble(
  Variable = names(diabetic_data),
  Type = sapply(diabetic_data, class)
)

#split into variable-type pairs
pairs <- datatype_tbl %>%
  mutate(Pair = map2(Variable, Type, ~c(.x, .y))) %>%
  pull(Pair) %>%
  unlist()

#set a table of 6 columns for better display
num_cols <- 6 
rows_needed <- ceiling(length(pairs) / num_cols)
length(pairs) <- rows_needed * num_cols 

#set column names and conver to matrix
col_names <- rep(c("Variable", "Type"), num_cols / 2)
pair_matrix <- matrix(pairs, ncol = num_cols, byrow = TRUE)

#display table using kable
kable(pair_matrix, col.names = col_names, escape = FALSE) %>%
  kable_styling(full_width = FALSE)
Variable Type Variable Type Variable Type
encounter_id numeric patient_nbr numeric race character
gender character age character weight character
admission_type_id numeric discharge_disposition_id numeric admission_source_id numeric
time_in_hospital numeric payer_code character medical_specialty character
num_lab_procedures numeric num_procedures numeric num_medications numeric
number_outpatient numeric number_emergency numeric number_inpatient numeric
diag_1 character diag_2 character diag_3 character
number_diagnoses numeric max_glu_serum character A1Cresult character
metformin character repaglinide character nateglinide character
chlorpropamide character glimepiride character acetohexamide character
glipizide character glyburide character tolbutamide character
pioglitazone character rosiglitazone character acarbose character
miglitol character troglitazone character tolazamide character
examide character citoglipton character insulin character
glyburide-metformin character glipizide-metformin character glimepiride-pioglitazone character
metformin-rosiglitazone character metformin-pioglitazone character change character
diabetesMed character readmitted character NA NA

The Diabetes 130-US Hospitals for Years 1999-2008 dataset is very challenging, as it meets all the Dataset Selection Guidelines stated in the project requirements.

  • Large Size: The size of the dataset is large, as it contains over 100,000 records.
  • Messy: The dataset is considered messy, with more than 192,000 missing values identified across multiple features.
  • Complex: The dataset is considerably complex, as it contains a mix of categorical and numerical attributes representing complex relationships between multiple health factors.
  • Integration: The data was integrated from different data sources that included 130 hospitals.
  • Multi-class: The dataset involves a multi-class classification task based on the readmitted attribute (<30 if the patient was readmitted within 30 days, >30 if readmitted after 30 days, and No if the patient was not readmitted).

These aspects highlight the dataset’s complexity and suitability for evaluating real-world clinical scenarios using computational statistical models.

3. Clean and Prepare

Load, Understand, and Categorize Data
Code
Code
#get dimensions of the dataset
dim(diabetic_data)
[1] 101766     50
Code
#display structure of the dataset
str(diabetic_data)
spc_tbl_ [101,766 × 50] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ encounter_id            : num [1:101766] 2278392 149190 64410 500364 16680 ...
 $ patient_nbr             : num [1:101766] 8222157 55629189 86047875 82442376 42519267 ...
 $ race                    : chr [1:101766] "Caucasian" "Caucasian" "AfricanAmerican" "Caucasian" ...
 $ gender                  : chr [1:101766] "Female" "Female" "Female" "Male" ...
 $ age                     : chr [1:101766] "[0-10)" "[10-20)" "[20-30)" "[30-40)" ...
 $ weight                  : chr [1:101766] "?" "?" "?" "?" ...
 $ admission_type_id       : num [1:101766] 6 1 1 1 1 2 3 1 2 3 ...
 $ discharge_disposition_id: num [1:101766] 25 1 1 1 1 1 1 1 1 3 ...
 $ admission_source_id     : num [1:101766] 1 7 7 7 7 2 2 7 4 4 ...
 $ time_in_hospital        : num [1:101766] 1 3 2 2 1 3 4 5 13 12 ...
 $ payer_code              : chr [1:101766] "?" "?" "?" "?" ...
 $ medical_specialty       : chr [1:101766] "Pediatrics-Endocrinology" "?" "?" "?" ...
 $ num_lab_procedures      : num [1:101766] 41 59 11 44 51 31 70 73 68 33 ...
 $ num_procedures          : num [1:101766] 0 0 5 1 0 6 1 0 2 3 ...
 $ num_medications         : num [1:101766] 1 18 13 16 8 16 21 12 28 18 ...
 $ number_outpatient       : num [1:101766] 0 0 2 0 0 0 0 0 0 0 ...
 $ number_emergency        : num [1:101766] 0 0 0 0 0 0 0 0 0 0 ...
 $ number_inpatient        : num [1:101766] 0 0 1 0 0 0 0 0 0 0 ...
 $ diag_1                  : chr [1:101766] "250.83" "276" "648" "8" ...
 $ diag_2                  : chr [1:101766] "?" "250.01" "250" "250.43" ...
 $ diag_3                  : chr [1:101766] "?" "255" "V27" "403" ...
 $ number_diagnoses        : num [1:101766] 1 9 6 7 5 9 7 8 8 8 ...
 $ max_glu_serum           : chr [1:101766] "None" "None" "None" "None" ...
 $ A1Cresult               : chr [1:101766] "None" "None" "None" "None" ...
 $ metformin               : chr [1:101766] "No" "No" "No" "No" ...
 $ repaglinide             : chr [1:101766] "No" "No" "No" "No" ...
 $ nateglinide             : chr [1:101766] "No" "No" "No" "No" ...
 $ chlorpropamide          : chr [1:101766] "No" "No" "No" "No" ...
 $ glimepiride             : chr [1:101766] "No" "No" "No" "No" ...
 $ acetohexamide           : chr [1:101766] "No" "No" "No" "No" ...
 $ glipizide               : chr [1:101766] "No" "No" "Steady" "No" ...
 $ glyburide               : chr [1:101766] "No" "No" "No" "No" ...
 $ tolbutamide             : chr [1:101766] "No" "No" "No" "No" ...
 $ pioglitazone            : chr [1:101766] "No" "No" "No" "No" ...
 $ rosiglitazone           : chr [1:101766] "No" "No" "No" "No" ...
 $ acarbose                : chr [1:101766] "No" "No" "No" "No" ...
 $ miglitol                : chr [1:101766] "No" "No" "No" "No" ...
 $ troglitazone            : chr [1:101766] "No" "No" "No" "No" ...
 $ tolazamide              : chr [1:101766] "No" "No" "No" "No" ...
 $ examide                 : chr [1:101766] "No" "No" "No" "No" ...
 $ citoglipton             : chr [1:101766] "No" "No" "No" "No" ...
 $ insulin                 : chr [1:101766] "No" "Up" "No" "Up" ...
 $ glyburide-metformin     : chr [1:101766] "No" "No" "No" "No" ...
 $ glipizide-metformin     : chr [1:101766] "No" "No" "No" "No" ...
 $ glimepiride-pioglitazone: chr [1:101766] "No" "No" "No" "No" ...
 $ metformin-rosiglitazone : chr [1:101766] "No" "No" "No" "No" ...
 $ metformin-pioglitazone  : chr [1:101766] "No" "No" "No" "No" ...
 $ change                  : chr [1:101766] "No" "Ch" "No" "Ch" ...
 $ diabetesMed             : chr [1:101766] "No" "Yes" "Yes" "Yes" ...
 $ readmitted              : chr [1:101766] "NO" ">30" "NO" "NO" ...
 - attr(*, "spec")=
  .. cols(
  ..   encounter_id = col_double(),
  ..   patient_nbr = col_double(),
  ..   race = col_character(),
  ..   gender = col_character(),
  ..   age = col_character(),
  ..   weight = col_character(),
  ..   admission_type_id = col_double(),
  ..   discharge_disposition_id = col_double(),
  ..   admission_source_id = col_double(),
  ..   time_in_hospital = col_double(),
  ..   payer_code = col_character(),
  ..   medical_specialty = col_character(),
  ..   num_lab_procedures = col_double(),
  ..   num_procedures = col_double(),
  ..   num_medications = col_double(),
  ..   number_outpatient = col_double(),
  ..   number_emergency = col_double(),
  ..   number_inpatient = col_double(),
  ..   diag_1 = col_character(),
  ..   diag_2 = col_character(),
  ..   diag_3 = col_character(),
  ..   number_diagnoses = col_double(),
  ..   max_glu_serum = col_character(),
  ..   A1Cresult = col_character(),
  ..   metformin = col_character(),
  ..   repaglinide = col_character(),
  ..   nateglinide = col_character(),
  ..   chlorpropamide = col_character(),
  ..   glimepiride = col_character(),
  ..   acetohexamide = col_character(),
  ..   glipizide = col_character(),
  ..   glyburide = col_character(),
  ..   tolbutamide = col_character(),
  ..   pioglitazone = col_character(),
  ..   rosiglitazone = col_character(),
  ..   acarbose = col_character(),
  ..   miglitol = col_character(),
  ..   troglitazone = col_character(),
  ..   tolazamide = col_character(),
  ..   examide = col_character(),
  ..   citoglipton = col_character(),
  ..   insulin = col_character(),
  ..   `glyburide-metformin` = col_character(),
  ..   `glipizide-metformin` = col_character(),
  ..   `glimepiride-pioglitazone` = col_character(),
  ..   `metformin-rosiglitazone` = col_character(),
  ..   `metformin-pioglitazone` = col_character(),
  ..   change = col_character(),
  ..   diabetesMed = col_character(),
  ..   readmitted = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 
Code
#dummary statistics for each column
summary(diabetic_data)
  encounter_id        patient_nbr            race              gender         
 Min.   :    12522   Min.   :      135   Length:101766      Length:101766     
 1st Qu.: 84961194   1st Qu.: 23413221   Class :character   Class :character  
 Median :152388987   Median : 45505143   Mode  :character   Mode  :character  
 Mean   :165201646   Mean   : 54330401                                        
 3rd Qu.:230270888   3rd Qu.: 87545950                                        
 Max.   :443867222   Max.   :189502619                                        
     age               weight          admission_type_id
 Length:101766      Length:101766      Min.   :1.000    
 Class :character   Class :character   1st Qu.:1.000    
 Mode  :character   Mode  :character   Median :1.000    
                                       Mean   :2.024    
                                       3rd Qu.:3.000    
                                       Max.   :8.000    
 discharge_disposition_id admission_source_id time_in_hospital
 Min.   : 1.000           Min.   : 1.000      Min.   : 1.000  
 1st Qu.: 1.000           1st Qu.: 1.000      1st Qu.: 2.000  
 Median : 1.000           Median : 7.000      Median : 4.000  
 Mean   : 3.716           Mean   : 5.754      Mean   : 4.396  
 3rd Qu.: 4.000           3rd Qu.: 7.000      3rd Qu.: 6.000  
 Max.   :28.000           Max.   :25.000      Max.   :14.000  
  payer_code        medical_specialty  num_lab_procedures num_procedures
 Length:101766      Length:101766      Min.   :  1.0      Min.   :0.00  
 Class :character   Class :character   1st Qu.: 31.0      1st Qu.:0.00  
 Mode  :character   Mode  :character   Median : 44.0      Median :1.00  
                                       Mean   : 43.1      Mean   :1.34  
                                       3rd Qu.: 57.0      3rd Qu.:2.00  
                                       Max.   :132.0      Max.   :6.00  
 num_medications number_outpatient number_emergency  number_inpatient 
 Min.   : 1.00   Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.0000  
 1st Qu.:10.00   1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 0.0000  
 Median :15.00   Median : 0.0000   Median : 0.0000   Median : 0.0000  
 Mean   :16.02   Mean   : 0.3694   Mean   : 0.1978   Mean   : 0.6356  
 3rd Qu.:20.00   3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.: 1.0000  
 Max.   :81.00   Max.   :42.0000   Max.   :76.0000   Max.   :21.0000  
    diag_1             diag_2             diag_3          number_diagnoses
 Length:101766      Length:101766      Length:101766      Min.   : 1.000  
 Class :character   Class :character   Class :character   1st Qu.: 6.000  
 Mode  :character   Mode  :character   Mode  :character   Median : 8.000  
                                                          Mean   : 7.423  
                                                          3rd Qu.: 9.000  
                                                          Max.   :16.000  
 max_glu_serum       A1Cresult          metformin         repaglinide       
 Length:101766      Length:101766      Length:101766      Length:101766     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
 nateglinide        chlorpropamide     glimepiride        acetohexamide     
 Length:101766      Length:101766      Length:101766      Length:101766     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
  glipizide          glyburide         tolbutamide        pioglitazone      
 Length:101766      Length:101766      Length:101766      Length:101766     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
 rosiglitazone        acarbose           miglitol         troglitazone      
 Length:101766      Length:101766      Length:101766      Length:101766     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
  tolazamide          examide          citoglipton          insulin         
 Length:101766      Length:101766      Length:101766      Length:101766     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
 glyburide-metformin glipizide-metformin glimepiride-pioglitazone
 Length:101766       Length:101766       Length:101766           
 Class :character    Class :character    Class :character        
 Mode  :character    Mode  :character    Mode  :character        
                                                                 
                                                                 
                                                                 
 metformin-rosiglitazone metformin-pioglitazone    change         
 Length:101766           Length:101766          Length:101766     
 Class :character        Class :character       Class :character  
 Mode  :character        Mode  :character       Mode  :character  
                                                                  
                                                                  
                                                                  
 diabetesMed         readmitted       
 Length:101766      Length:101766     
 Class :character   Class :character  
 Mode  :character   Mode  :character  
                                      
                                      
                                      

The dataset named “diabetic_data.csv” was loaded using read_csv() with show_col_types = FALSE to suppress the warning message that appears when column data types are automatically inferred. str() function was used to examine the structure of the dataset, whereas summary() function was used to provide key descriptive statistics such as minimum, median, mean, and maximum values. From the functions output it was observed that most of the diabetes dataset features fall into two main data types:

  • Numeric: These include hospitalization duration, lab test results, medications, and patient counts (e.g., time_in_hospital, num_procedures, num_medications).
  • Character: These include categorical attributes such as race, gender, age, diagnosis codes (diag_1, diag_2, diag_3), and medication-related features like insulin.
Identify and Handle Missing Values

Once a general understanding was gained on the dataset, the dataset was examined for missing and null values. The dataset was first scanned for standard NA valus using is.na() function. This check returned zero missing values, indicating that missing data might have been recorded in alternative forms. Upon further inspection of the dataset, it was found that missing values were represented in various non-standard forms such as “?” and “Unknown/Invalid”. These are often used to indicate missing or uncertain information. To systematically identify such entries, a loop was implemented to scan each column and count the occurrences of these values. After conducting a second-level inspection for missing values, eight features were identified with non-standard missing entries. These were primarily represented by question marks “?” and a few “Unknown/Invalid” values. The figure below displays the missing values distribution, all these missing values were replaced with NA for uniformity.

Code
Code
#check for actual NA values in each column
na_counts <- colSums(is.na(diabetic_data))

#convert to a data frame
na_summary <- enframe(na_counts, name = "Column", value = "NA_Count")

#display table using kable
kable(na_summary, caption = "Standard Missing Values (NA) per Column") %>%
  kable_styling(full_width = FALSE, position = "center")
Standard Missing Values (NA) per Column
Column NA_Count
encounter_id 0
patient_nbr 0
race 0
gender 0
age 0
weight 0
admission_type_id 0
discharge_disposition_id 0
admission_source_id 0
time_in_hospital 0
payer_code 0
medical_specialty 0
num_lab_procedures 0
num_procedures 0
num_medications 0
number_outpatient 0
number_emergency 0
number_inpatient 0
diag_1 0
diag_2 0
diag_3 0
number_diagnoses 0
max_glu_serum 0
A1Cresult 0
metformin 0
repaglinide 0
nateglinide 0
chlorpropamide 0
glimepiride 0
acetohexamide 0
glipizide 0
glyburide 0
tolbutamide 0
pioglitazone 0
rosiglitazone 0
acarbose 0
miglitol 0
troglitazone 0
tolazamide 0
examide 0
citoglipton 0
insulin 0
glyburide-metformin 0
glipizide-metformin 0
glimepiride-pioglitazone 0
metformin-rosiglitazone 0
metformin-pioglitazone 0
change 0
diabetesMed 0
readmitted 0
Code
#create empty vectors to store results
column_names <- c()
question_marks <- c()
empty_strings <- c()
unknowns <- c()
unknown_invalids <- c()

#loop through each column
for (i in 1:ncol(diabetic_data)) {
  col_name <- colnames(diabetic_data)[i]
  col_data <- diabetic_data[[i]] 

  #count each missing type
  qmark <- sum(col_data == "?", na.rm = TRUE)
  empty <- sum(col_data == "", na.rm = TRUE)
  unk <- sum(col_data == "Unknown", na.rm = TRUE)
  unk_inv <- sum(col_data == "Unknown/Invalid", na.rm = TRUE)

  #only record if at least one missing-like value exists
  if (qmark + empty + unk + unk_inv > 0) {
    column_names <- c(column_names, col_name)
    question_marks <- c(question_marks, qmark)
    empty_strings <- c(empty_strings, empty)
    unknowns <- c(unknowns, unk)
    unknown_invalids <- c(unknown_invalids, unk_inv)
  }
}

#combine all into one data frame (after the loop)
missing_summary <- data.frame(
  Column = column_names,
  Question_Mark = question_marks,
  Empty_String = empty_strings,
  Unknown = unknowns,
  Unknown_Invalid = unknown_invalids
)
Code
#display table using kable
kable(missing_summary, align = "lcccc", caption = "Summary of Missing Values") %>%
  kable_styling(full_width = FALSE, position = "center")
Summary of Missing Values
Column Question_Mark Empty_String Unknown Unknown_Invalid
race 2273 0 0 0
gender 0 0 0 3
weight 98569 0 0 0
payer_code 40256 0 0 0
medical_specialty 49949 0 0 0
diag_1 21 0 0 0
diag_2 358 0 0 0
diag_3 1423 0 0 0
Code
#prepare data for ggplot
missing_long <- missing_summary %>%
  pivot_longer(
    cols = c(Question_Mark, Empty_String, Unknown, Unknown_Invalid),
    names_to = "Type",
    values_to = "Count"
  )

#plot missing value count
p <- ggplot(missing_long, aes(x = reorder(Column, -Count), y = Count, fill = Type)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Summary of Missing Values Counts",
    x = "Feature",
    y = "Missing Count",
    fill = "Missing Type"
  ) +
  scale_fill_viridis_d(option = "D", begin = 0.1, end = 0.9) + #color-blind friendly
  theme_minimal(base_size = 10) +
  theme(
    axis.text.x = element_text(angle = 40, hjust = 1),
    plot.title = element_text(face = "bold", size = 10, hjust = 0.5))

#make plot interactive
ggplotly(p)
Code
Code
#replace "?" with NA
diabetic_data[diabetic_data == "?"] <- NA

#replace "Unknown/Invalid" with NA
diabetic_data[diabetic_data == "Unknown/Invalid"] <- NA

The following features were removed for the dataset as each of these features has more than 40% missing values. Additionally, these features are not not essential for building a reliable predictive model nor impact the classification task to predict diabetic patient readmission status within 30 days of discharge.

  • weight represents the patient’s body weight. More than 97% of weight values are missing. Due to its lack of completeness, it is not practical to include this feature.
  • payer_code refers to the patient’s method of payment or insurance with approximately 40% missing data. This feature is more administrative than predictive and has minimal relevance to the classification goal.
  • medical_specialty indicates the specialty of the admitting physician, with nearly 50% of missing values. Additionally, this feature contains a large number of inconsistent and sparse categories, which will not be beneficial during model building and training.

Since the dataset is already large and only a reletvily small number of observations had missing values in important features like race, gender, and diagnosis codes, it was decided to remove those observations reduceding the dataset from 101,766 to 98,052 observations, keeping only complete records for analysis. Additionally, the features encounter_id (a unique ID for each hospital visit) and patient_nbr (a unique ID for each patient) were removed since these two features do not provide any useful information for predicting diabetic patient readmission status.

Code
Code
#remove the weight, payer_code, medical_specialty  column
diabetic_data <- diabetic_data %>% select(-c(weight, payer_code, medical_specialty))

dim(diabetic_data)
[1] 101766     47
Code
#remove rows that contain any NA values
diabetic_data <- na.omit(diabetic_data)
dim(diabetic_data)
[1] 98052    47
Code
#remove the encounter ID and patient number columns
diabetic_data <- diabetic_data %>% select(-c(encounter_id, patient_nbr ))
dim(diabetic_data)
[1] 98052    45
Features Engineering for Categorical Features

The number of unique values in each column are counted to identify features with high cardinality. Most medication-related features had only 2–4 unique values, making them straightforward to include in the model. However, the diagnosis features (diag_1, diag_2, and diag_3) contained hundreds of unique ICD-9 codes, the official system of assigning codes to diagnoses associated with hospital in the US (Strack et al. 2014). Thus, using the diagnosis codes directly would be complex and may lead to overfitting. As these codes are highly uninterpretable, they were mapped into broader disease groups, following guidelines from the reference documentation (Strack et al. 2014). This transformation preserves the medical meaning while reducing the number of categories. For example, codes between 390–459 and 785 were mapped to “Circulatory”, while codes from 460–519 and 786 were grouped under “Respiratory”. Some codes such as “V45” did not fall into any of the defined ICD-9 groupings, thus were assigned the group “Unknown” to preserve them for further review.

Code
Code
for (col_name in colnames(diabetic_data)) {
  print(col_name)
  print(length(unique(diabetic_data[[col_name]])))
}
[1] "race"
[1] 5
[1] "gender"
[1] 2
[1] "age"
[1] 10
[1] "admission_type_id"
[1] 8
[1] "discharge_disposition_id"
[1] 26
[1] "admission_source_id"
[1] 17
[1] "time_in_hospital"
[1] 14
[1] "num_lab_procedures"
[1] 118
[1] "num_procedures"
[1] 7
[1] "num_medications"
[1] 75
[1] "number_outpatient"
[1] 39
[1] "number_emergency"
[1] 33
[1] "number_inpatient"
[1] 20
[1] "diag_1"
[1] 713
[1] "diag_2"
[1] 740
[1] "diag_3"
[1] 786
[1] "number_diagnoses"
[1] 14
[1] "max_glu_serum"
[1] 4
[1] "A1Cresult"
[1] 4
[1] "metformin"
[1] 4
[1] "repaglinide"
[1] 4
[1] "nateglinide"
[1] 4
[1] "chlorpropamide"
[1] 4
[1] "glimepiride"
[1] 4
[1] "acetohexamide"
[1] 2
[1] "glipizide"
[1] 4
[1] "glyburide"
[1] 4
[1] "tolbutamide"
[1] 2
[1] "pioglitazone"
[1] 4
[1] "rosiglitazone"
[1] 4
[1] "acarbose"
[1] 4
[1] "miglitol"
[1] 4
[1] "troglitazone"
[1] 2
[1] "tolazamide"
[1] 3
[1] "examide"
[1] 1
[1] "citoglipton"
[1] 1
[1] "insulin"
[1] 4
[1] "glyburide-metformin"
[1] 4
[1] "glipizide-metformin"
[1] 2
[1] "glimepiride-pioglitazone"
[1] 2
[1] "metformin-rosiglitazone"
[1] 1
[1] "metformin-pioglitazone"
[1] 2
[1] "change"
[1] 2
[1] "diabetesMed"
[1] 2
[1] "readmitted"
[1] 3
Code
# === diag_1 ===
diabetic_data$diag_1_prefix <- substr(diabetic_data$diag_1, 1, 3)
diabetic_data$diag_1_num <- as.numeric(diabetic_data$diag_1_prefix)
Warning: NAs introduced by coercion
Code
diabetic_data$diag_1_group <- NA
diabetic_data$diag_1_group[diabetic_data$diag_1_num >= 390 & diabetic_data$diag_1_num <= 459 | diabetic_data$diag_1_num == 785] <- "Circulatory"
diabetic_data$diag_1_group[diabetic_data$diag_1_num >= 460 & diabetic_data$diag_1_num <= 519 | diabetic_data$diag_1_num == 786] <- "Respiratory"
diabetic_data$diag_1_group[diabetic_data$diag_1_num >= 520 & diabetic_data$diag_1_num <= 579 | diabetic_data$diag_1_num == 787] <- "Digestive"
diabetic_data$diag_1_group[diabetic_data$diag_1_num >= 250 & diabetic_data$diag_1_num < 251] <- "Diabetes"
diabetic_data$diag_1_group[diabetic_data$diag_1_num >= 800 & diabetic_data$diag_1_num <= 999] <- "Injury"
diabetic_data$diag_1_group[diabetic_data$diag_1_num >= 710 & diabetic_data$diag_1_num <= 739] <- "Musculoskeletal"
diabetic_data$diag_1_group[diabetic_data$diag_1_num >= 580 & diabetic_data$diag_1_num <= 629 | diabetic_data$diag_1_num == 788] <- "Genitourinary"
diabetic_data$diag_1_group[diabetic_data$diag_1_num >= 140 & diabetic_data$diag_1_num <= 239] <- "Neoplasms"
diabetic_data$diag_1_group[diabetic_data$diag_1_num >= 1 & diabetic_data$diag_1_num <= 139] <- "Infectious"
diabetic_data$diag_1_group[diabetic_data$diag_1_num >= 290 & diabetic_data$diag_1_num <= 319] <- "Mental Disorders"
diabetic_data$diag_1_group[diabetic_data$diag_1_num %in% c(780, 781, 784) | (diabetic_data$diag_1_num >= 790 & diabetic_data$diag_1_num <= 799)] <- "Other"
diabetic_data$diag_1_group[is.na(diabetic_data$diag_1_group)] <- "Unknown"  #mark any unmatched code as unknown

# === diag_2 ===
diabetic_data$diag_2_prefix <- substr(diabetic_data$diag_2, 1, 3)
diabetic_data$diag_2_num <- as.numeric(diabetic_data$diag_2_prefix)
Warning: NAs introduced by coercion
Code
diabetic_data$diag_2_group <- NA
diabetic_data$diag_2_group[diabetic_data$diag_2_num >= 390 & diabetic_data$diag_2_num <= 459 | diabetic_data$diag_2_num == 785] <- "Circulatory"
diabetic_data$diag_2_group[diabetic_data$diag_2_num >= 460 & diabetic_data$diag_2_num <= 519 | diabetic_data$diag_2_num == 786] <- "Respiratory"
diabetic_data$diag_2_group[diabetic_data$diag_2_num >= 520 & diabetic_data$diag_2_num <= 579 | diabetic_data$diag_2_num == 787] <- "Digestive"
diabetic_data$diag_2_group[diabetic_data$diag_2_num >= 250 & diabetic_data$diag_2_num < 251] <- "Diabetes"
diabetic_data$diag_2_group[diabetic_data$diag_2_num >= 800 & diabetic_data$diag_2_num <= 999] <- "Injury"
diabetic_data$diag_2_group[diabetic_data$diag_2_num >= 710 & diabetic_data$diag_2_num <= 739] <- "Musculoskeletal"
diabetic_data$diag_2_group[diabetic_data$diag_2_num >= 580 & diabetic_data$diag_2_num <= 629 | diabetic_data$diag_2_num == 788] <- "Genitourinary"
diabetic_data$diag_2_group[diabetic_data$diag_2_num >= 140 & diabetic_data$diag_2_num <= 239] <- "Neoplasms"
diabetic_data$diag_2_group[diabetic_data$diag_2_num >= 1 & diabetic_data$diag_2_num <= 139] <- "Infectious"
diabetic_data$diag_2_group[diabetic_data$diag_2_num >= 290 & diabetic_data$diag_2_num <= 319] <- "Mental Disorders"
diabetic_data$diag_2_group[diabetic_data$diag_2_num %in% c(780, 781, 784) | (diabetic_data$diag_2_num >= 790 & diabetic_data$diag_2_num <= 799)] <- "Other"
diabetic_data$diag_2_group[is.na(diabetic_data$diag_2_group)] <- "Unknown"

# === diag_3 ===
diabetic_data$diag_3_prefix <- substr(diabetic_data$diag_3, 1, 3)
diabetic_data$diag_3_num <- as.numeric(diabetic_data$diag_3_prefix)
Warning: NAs introduced by coercion
Code
diabetic_data$diag_3_group <- NA
diabetic_data$diag_3_group[diabetic_data$diag_3_num >= 390 & diabetic_data$diag_3_num <= 459 | diabetic_data$diag_3_num == 785] <- "Circulatory"
diabetic_data$diag_3_group[diabetic_data$diag_3_num >= 460 & diabetic_data$diag_3_num <= 519 | diabetic_data$diag_3_num == 786] <- "Respiratory"
diabetic_data$diag_3_group[diabetic_data$diag_3_num >= 520 & diabetic_data$diag_3_num <= 579 | diabetic_data$diag_3_num == 787] <- "Digestive"
diabetic_data$diag_3_group[diabetic_data$diag_3_num >= 250 & diabetic_data$diag_3_num < 251] <- "Diabetes"
diabetic_data$diag_3_group[diabetic_data$diag_3_num >= 800 & diabetic_data$diag_3_num <= 999] <- "Injury"
diabetic_data$diag_3_group[diabetic_data$diag_3_num >= 710 & diabetic_data$diag_3_num <= 739] <- "Musculoskeletal"
diabetic_data$diag_3_group[diabetic_data$diag_3_num >= 580 & diabetic_data$diag_3_num <= 629 | diabetic_data$diag_3_num == 788] <- "Genitourinary"
diabetic_data$diag_3_group[diabetic_data$diag_3_num >= 140 & diabetic_data$diag_3_num <= 239] <- "Neoplasms"
diabetic_data$diag_3_group[diabetic_data$diag_3_num >= 1 & diabetic_data$diag_3_num <= 139] <- "Infectious"
diabetic_data$diag_3_group[diabetic_data$diag_3_num >= 290 & diabetic_data$diag_3_num <= 319] <- "Mental Disorders"
diabetic_data$diag_3_group[diabetic_data$diag_3_num %in% c(780, 781, 784) | (diabetic_data$diag_3_num >= 790 & diabetic_data$diag_3_num <= 799)] <- "Other"
diabetic_data$diag_3_group[is.na(diabetic_data$diag_3_group)] <- "Unknown"

dim(diabetic_data)
[1] 98052    54

The original diagnosis featuers diag_1, diag_2, and diag_3 were removed after new diagnosis group feature was engineered. This transformation helps reduce high cardinality and potential noise, making the data more suitable for building models. The intermediate columns used during the transformation (such as prefixes and numeric conversions) were also dropped to keep the dataset clean and focused on meaningful features. The diagnosis values that could not be mapped to ICD-9 categories and were labeled as “Unknown”, lacked clinical meaning and were thus removed from the analysis. Removing these observations helps ensure cleaner inputs for modeling and reduces the risk of introducing noise or ambiguity during model training. As a result, the dataset was reduced from 98,052 to 56,673 observations.

Code
Code
#drop original diagnosis columns: diag_1, diag_2, diag_3
diabetic_data <- diabetic_data %>% select(-c(diag_1, diag_2, diag_3))


#drop helper columns used for processing
diabetic_data <- diabetic_data %>% select(-c(diag_1_prefix, diag_2_prefix, diag_3_prefix))
diabetic_data <- diabetic_data %>% select(-c(diag_1_num, diag_2_num, diag_3_num))

dim(diabetic_data)
[1] 98052    45
Code
diabetic_data$diag_1_group[diabetic_data$diag_1_group == "Unknown"] <- NA
diabetic_data$diag_2_group[diabetic_data$diag_2_group == "Unknown"] <- NA
diabetic_data$diag_3_group[diabetic_data$diag_3_group == "Unknown"] <- NA

diabetic_data <- na.omit(diabetic_data)

#check new dimensions
dim(diabetic_data)
[1] 56673    45

Based on the summary of unique values and a review of the dataset, two features were further dropped: discharge_disposition_id (26 unique values) and admission_source_id (15 unique values). Although these columns are stored as numeric values, they actually represent nominal categories (e.g., source of admission, discharge status) . Keeping these features would require encoding them into many sub-features using one-hot encoding. This could increase dimensionality and make the model more complex, especially since we don’t have labels explaining what each ID value represents. More importantly, using them directly in models (especially linear models or distance-based models like KNN) can confuse the model, as it might interpret the numerical IDs incorrectly. For example, it may think that ID 4 is twice as important as ID 2, which is not true. Furthermore, all character-type features were converted to factors to ensure categorical variables are properly recognized and interpreted by statistical models.

Code
Code
diabetic_data <- diabetic_data %>% select(-c(discharge_disposition_id, admission_source_id))
dim(diabetic_data)
[1] 56673    43
Code
#convert all character columns to factors
for (col in names(diabetic_data)) {
  if (class(diabetic_data[[col]]) == "character") {
    diabetic_data[[col]] <- as.factor(diabetic_data[[col]])
  }
}

#show which columns are now factors
str(diabetic_data)
tibble [56,673 × 43] (S3: tbl_df/tbl/data.frame)
 $ race                    : Factor w/ 5 levels "AfricanAmerican",..: 3 3 3 3 3 3 1 3 3 1 ...
 $ gender                  : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 1 2 1 ...
 $ age                     : Factor w/ 10 levels "[0-10)","[10-20)",..: 4 5 6 8 9 10 5 5 9 7 ...
 $ admission_type_id       : num [1:56673] 1 1 2 1 2 3 1 1 1 3 ...
 $ time_in_hospital        : num [1:56673] 2 1 3 5 13 12 9 7 10 1 ...
 $ num_lab_procedures      : num [1:56673] 44 51 31 73 68 33 47 60 55 49 ...
 $ num_procedures          : num [1:56673] 1 0 6 0 2 3 2 0 1 5 ...
 $ num_medications         : num [1:56673] 16 8 16 12 28 18 17 15 31 2 ...
 $ number_outpatient       : num [1:56673] 0 0 0 0 0 0 0 0 0 0 ...
 $ number_emergency        : num [1:56673] 0 0 0 0 0 0 0 1 0 0 ...
 $ number_inpatient        : num [1:56673] 0 0 0 0 0 0 0 0 0 0 ...
 $ number_diagnoses        : num [1:56673] 7 5 9 8 8 8 9 8 8 8 ...
 $ max_glu_serum           : Factor w/ 4 levels ">200",">300",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ A1Cresult               : Factor w/ 4 levels ">7",">8","None",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ metformin               : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 3 2 2 ...
 $ repaglinide             : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 4 2 2 ...
 $ nateglinide             : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ chlorpropamide          : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ glimepiride             : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ acetohexamide           : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
 $ glipizide               : Factor w/ 4 levels "Down","No","Steady",..: 2 3 2 2 3 2 2 2 2 2 ...
 $ glyburide               : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 3 2 2 2 2 2 2 ...
 $ tolbutamide             : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
 $ pioglitazone            : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ rosiglitazone           : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 3 2 2 2 2 ...
 $ acarbose                : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ miglitol                : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ troglitazone            : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
 $ tolazamide              : Factor w/ 3 levels "No","Steady",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ examide                 : Factor w/ 1 level "No": 1 1 1 1 1 1 1 1 1 1 ...
 $ citoglipton             : Factor w/ 1 level "No": 1 1 1 1 1 1 1 1 1 1 ...
 $ insulin                 : Factor w/ 4 levels "Down","No","Steady",..: 4 3 3 2 3 3 3 1 3 3 ...
 $ glyburide-metformin     : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ glipizide-metformin     : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
 $ glimepiride-pioglitazone: Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
 $ metformin-rosiglitazone : Factor w/ 1 level "No": 1 1 1 1 1 1 1 1 1 1 ...
 $ metformin-pioglitazone  : Factor w/ 1 level "No": 1 1 1 1 1 1 1 1 1 1 ...
 $ change                  : Factor w/ 2 levels "Ch","No": 1 1 2 2 1 1 2 1 2 2 ...
 $ diabetesMed             : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
 $ readmitted              : Factor w/ 3 levels "<30",">30","NO": 3 3 2 2 3 3 2 1 3 2 ...
 $ diag_1_group            : Factor w/ 11 levels "Circulatory",..: 5 9 1 1 1 1 2 1 1 11 ...
 $ diag_2_group            : Factor w/ 11 levels "Circulatory",..: 2 9 1 11 1 9 1 2 1 6 ...
 $ diag_3_group            : Factor w/ 11 levels "Circulatory",..: 1 2 2 2 5 11 6 2 1 4 ...
 - attr(*, "na.action")= 'omit' Named int [1:41379] 1 2 6 11 17 20 25 28 34 39 ...
  ..- attr(*, "names")= chr [1:41379] "1" "2" "6" "11" ...
Identify and Handle Outliers

To further clean and prepare the dataset for modeling, dataset outliers shall be identified and appropriately handled to reduce outliers impact on model training, model performance, and improve the robustness of predictive results. The dataset numerical features were examined for potential outliers using the Interquartile Range \(IQR\) method. This method uses IQR to count the number of outliers in each numeric feature and stores the results in a data-frame for further analysis and visualization. The numerical features in the dataset were first identified using is.numeric() function. Each numeric feature is processed through the \(IQR\) outlier detection method: calculating the first quartile \(Q1\), third quartile \(Q3\), the lower and upper bounds (defined as \(Q1 -1.5 × IQR\) and \(Q3 + 1.5 × IQR\), respectively).

The boxplots below provides a comparative view of the numerical features with outliers that were identified by the IQR method. Features such as num_lab_procedures, num_medications, and various visits (number_emergency, number_inpatient, number_outpatient) exhibit a clear right-skewedness distribution with a large number of high-end outliers. These features may need special treatment before normalization or modeling to avoid bias in the results.

Code
Code
#identify numeric columns
numeric_cols <- diabetic_data %>% select(where(is.numeric))

#loop over numeric columns and calculate outliers using IQR
outlier_summary <- numeric_cols %>%
  map_df(~{
    Q1 <- quantile(.x, 0.25)
    Q3 <- quantile(.x, 0.75)
    IQR <- Q3 - Q1
    lower_bound <- Q1 - 1.5 * IQR
    upper_bound <- Q3 + 1.5 * IQR
    outlier_count <- sum(.x < lower_bound | .x > upper_bound)
    tibble(Outlier_Count = outlier_count)
  }, .id = "Feature") %>%
  arrange(desc(Outlier_Count))
Code
#display table using kable
kable(outlier_summary, caption = "Numerical Features Outliers (Using IQR)") %>%
  kable_styling(full_width = FALSE, position = "center")
Numerical Features Outliers (Using IQR)
Feature Outlier_Count
number_outpatient 9093
number_emergency 6090
number_inpatient 3908
num_medications 1631
time_in_hospital 1242
admission_type_id 204
num_lab_procedures 133
number_diagnoses 25
num_procedures 0
Code
#reshape to long format for ggplot
diabetic_data_long <- diabetic_data[, outlier_summary$Feature] %>%
  mutate(row_id = row_number()) %>%
  pivot_longer(cols = -row_id, names_to = "Feature", values_to = "Value")

#plot boxplot using ggplot
p <- ggplot(diabetic_data_long, aes(x = Feature, y = Value, color = Feature)) +
  geom_boxplot() +
  theme_minimal(base_size = 10) +
  theme(
    axis.text.y = element_text(size = 10),
    axis.text.x = element_text(angle = 40),
    plot.title = element_text(face = "bold", size = 10, hjust = 0.3),
    legend.position = "none"
  ) +
  scale_y_continuous(expand = expansion(mult = c(0.05, 0.1))) +
  labs(
    title = "Numeric Features with Outliers",
    x = "Feature",
    y = "Value"
  )

#make plot interactive
ggplotly(p)

To handle outliers, we iterated over all numeric variables and marked observations containing an outlier for easier identification. Since normalization (like Min-Max scaling) is highly sensitive to extreme values the presence of outliers in the dataset will have an extreme impact. If we normalize before removing outliers, the min and max values will be skewed, and most of the data will be squished into a narrow range near 0, making it harder for the model to learn from the majority of the data. Thus, observations with outliers values are removed resulting in a dataset with 39,304 observations and 43 features.

Code
Code
#create a logical index of rows to keep (initialize with TRUEs)
keep_rows <- rep(TRUE, nrow(diabetic_data))

#loop over numeric columns and mark rows with outliers
for (col in names(numeric_cols)){
  values <- diabetic_data[[col]]
  Q1 <- quantile(values, 0.25)
  Q3 <- quantile(values, 0.75)
  IQR <- Q3 - Q1
  lower_bound <- Q1 - 1.5 * IQR
  upper_bound <- Q3 + 1.5 * IQR

  #identify outliers for this feature
  outlier_mask <- values < lower_bound | values > upper_bound
  
  #mark FALSE for rows that are outliers in this column
  keep_rows <- keep_rows & !outlier_mask
}

#apply the filter once
diabetic_data <- diabetic_data[keep_rows, ]

#check new dimensions
dim(diabetic_data)
[1] 39304    43
Normalization

Min-Max normalization is applied to selected numeric features as they have different scales and large ranges, which could unfairly influence models that are sensitive to feature magnitude, such as KNN or logistic regression. Based on the output of the summary() function, it was observed that the following six features had relatively high maximum values or wide ranges: num_lab_procedures, num_medications, number_outpatient, number_emergency, number_inpatient, and number_diagnoses. These features were normalized using Min-Max scaling, which transforms their values to fall between 0 and 1. This ensures that no single feature dominates the learning process due to having larger numeric values. Features that had small ranges or represented categorical values (such as admission_type_id) were left unchanged, as normalization is not meaningful or necessary in those cases.

Code
Code
normalize <- function(x) {
  return((x - min(x)) / (max(x) - min(x)))
}

#apply normalization to selected features
diabetic_data$num_lab_procedures   <- normalize(diabetic_data$num_lab_procedures)
diabetic_data$num_medications      <- normalize(diabetic_data$num_medications)
diabetic_data$number_outpatient    <- normalize(diabetic_data$number_outpatient)
diabetic_data$number_emergency     <- normalize(diabetic_data$number_emergency)
diabetic_data$number_inpatient     <- normalize(diabetic_data$number_inpatient)
diabetic_data$number_diagnoses     <- normalize(diabetic_data$number_diagnoses)

4. Explore and Visualize

Exploratory Data Analysis (EDA) has been conducted on the cleaned dataset to gain a comprehensive insights into the dataset, uncover patterns, and identify modeling strategies. The analysis focuses the target variable readmitted and the dataset features that are potentially associated with readmission status, contributing to the classification goal to predict diabetic patient readmission status within 30 days of discharge.

Target Variable Distribution

The bar chart below visualizes the distribution of the target variable readmitted. It can be noted that more than 50% of patients were not readmitted, around 30% of patients were readmitted after 30 days, and only 10% were readmitted within 30 days. The bar chart reveals a significant class imbalance, where the high-risk class readmitted within 30 days (<30) being the minority class. This class imbalance highlights the need for class weights adjustment during modeling to improve classifier sensitivity.

Code
Code
#summary table with three columns and rename them as follow
readmit_dist <- diabetic_data %>%
  dplyr::count(readmitted, name = "Total_Patients") %>%
  dplyr::mutate(
    Proportion = Total_Patients / sum(Total_Patients),
    Percentage = percent(Proportion, accuracy = 1)
  ) %>%
  rename(`Readmission Category` = readmitted) %>%
  select(`Readmission Category`, Total_Patients, Percentage)
Code
#plot readmission class distribution
p <- ggplot(readmit_dist, aes(x = reorder(`Readmission Category`, -Total_Patients), y = Total_Patients, fill = `Readmission Category`)) +
  geom_col(width = 0.6, show.legend = FALSE) +
  geom_text(
    aes(label = paste0(Percentage, "\n(", comma(Total_Patients), ")")),
    vjust = -0.3,
    size = 2,
    fontface = "bold"
  ) +
  scale_fill_viridis_d(option = "D", begin = 0.1, end = 0.9) + #color-blind friendly
  scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.1))) +
  labs(
    title = "Distribution of Readmission Status",
    x = "Readmission Category",
    y = "Number of Patients"
  ) +
  theme_minimal(base_size = 10)+
   theme(
    plot.title = element_text(face = "bold", size = 10, hjust = 0.5))

#make plot interactive
ggplotly(p)
Patients Demographic Composition

Patients demographic features such as gender, age, and race were analyzed to uncover readmission patterns and understand where the bulk of readmissions is coming from. The figure below highlights that readmission status are relatively balanced between males and females, with both gender having similar readmission patterns, indicating no major impact. Looking at the age groups the figure highlights most patients are between 50 and 80 years old, where we see the highest readmission counts can be noted, indicating that age is a possible factor for impacting readmission status. The higher counts within caucasian patients reflects population volume rather than higher risk of readmission, as caucasian patients make up the largest racial group within the dataset; therefore, proportions analysis was also conducted to provide a more accurate comparison.

Code
#gender plot
p1 <- ggplot(diabetic_data, aes(x = gender, fill = readmitted)) +
  geom_bar(position = "dodge") +
  labs(
    x = "Gender",
    fill = "Readmitted"
  ) +
  scale_fill_viridis_d(option = "D", begin = 0.1, end = 0.9) + #color-blind friendly
  theme_minimal(base_size = 10) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.2),
    legend.position = "right"
  )

#age plot
p2 <- ggplot(diabetic_data, aes(x = age, fill = readmitted)) +
  geom_bar(position = "dodge") +
  labs(
    x = "Age",
    fill = "Readmitted"
  ) +
  scale_fill_viridis_d(option = "D", begin = 0.1, end = 0.9) + #color-blind friendly
  theme_minimal(base_size = 10) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.2),
    axis.text.x = element_text(angle = 40, hjust = 0.2)
  )

#race plot
p3 <- ggplot(diabetic_data, aes(x = race, fill = readmitted)) +
  geom_bar(position = "dodge") +
  labs(
    x = "Race",
    fill = "Readmitted"
  ) +
  scale_fill_viridis_d(option = "D", begin = 0.1, end = 0.9) + #color-blind friendly
  theme_minimal(base_size = 10) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.2),
    axis.text.x = element_text(angle = 40, hjust = 0.2)
  )

#combine plots horizontally and make plot interactive
subplot_combined <- subplot(
  ggplotly(p1),
  ggplotly(p2) %>% layout(showlegend = FALSE),
  ggplotly(p3) %>% layout(showlegend = FALSE),
  nrows = 1,
  shareX = FALSE,
  shareY = FALSE,
  titleX = TRUE,
  titleY = FALSE
) %>%
  layout(
    title = list(
      text = "Patient Readmissions by Gender, Age, and Race",
      x = 0.5,
      xanchor = "center",
      font = list(size = 16, family = "Arial", color = "#333333")
    ),
    annotations = list(
      list(
        text = "Number of Patients",
        x = 0,
        xref = "paper",
        y = 0.5,
        yref = "paper",
        showarrow = FALSE,
        font = list(size = 12),
        textangle = -90
      )
    )
  )

#display the final result
subplot_combined
Patient Profile Comparison

The radar chart below provides a comparative overview of patient profiles across the three readmission classes. Patients who were readmitted within 30 days exhibit higher counts across most variables, including number of medications, length of hospital stay, emergency room visits, and inpatient encounters, suggesting a more complicated medical profile. In contrast, patients who were not readmitted generally demonstrate lower counts across most variables, with slight exception on the number of procedures, which may reflect more planned preventive care or early interventions rather than medical complications.

Code
#get the average profile for each readmission group
radar_data <- diabetic_data %>%
  group_by(readmitted) %>%
  summarise(
    Medications = mean(num_medications, na.rm = TRUE),
    Procedures = mean(num_procedures, na.rm = TRUE),
    TimeInHospital = mean(time_in_hospital, na.rm = TRUE),
    ERVisits = mean(number_emergency, na.rm = TRUE),
    InpatientVisits = mean(number_inpatient, na.rm = TRUE),
    OutpatientVisits = mean(number_outpatient, na.rm = TRUE),
    .groups = "drop"
  )

#fmsb needs the first two rows to define the range (max + min) of the axes
radar_chart <- rbind(
  apply(radar_data[,-1], 2, max),
  apply(radar_data[,-1], 2, min),
  radar_data[,-1]
)

#convert to numeric
radar_chart <- as.data.frame(lapply(radar_chart, as.numeric))
rownames(radar_chart) <- c("Max", "Min", radar_data$readmitted)

#set custom color-blind friendly colors
custom_colors <- c("#21908C", "#440154", "#5DC863")
colors_fill <- scales::alpha(custom_colors, 0.3)

#plot radar chart
radarchart(
  radar_chart,
  axistype = 1,
  pcol = custom_colors,
  pfcol = colors_fill,
  plwd = 2,
  plty = 1,
  cglcol = "grey",
  cglty = 1,
  axislabcol = "grey30",
  vlcex = 0.85,
  title = "Patient Profile Comparison by Readmission Status"
)

#add a legend to keep it readable
legend("topright", legend = radar_data$readmitted,
       bty = "n", pch = 20, col = custom_colors, text.col = "black", cex = 0.9)

Diagnosis-Based Readmission Patterns

The feature engineering carried out on the diagnosis codes feature, in the clean and prepare phase, facilitated an interpretable analysis on the impact of the diagnosis categories on the patient readmission status. The chart shows the distribution of diagnosis categories the three diagnosis levels (diag_1, diag_2, and diag_3), grouped by readmission status. Circulatory and Other conditions are the most frequent across all diagnosis levels, especially in the primary diagnosis (diag_1). In contrast, conditions like Diabetes and Neoplasms are more frequently recorded as secondary or tertiary issues, suggesting their significant impact on patient readmission status. Overall, this visualization provides insights on the underlying medical conditions that are possibly associated with hospital readmission, uncovering readmission pattrens.

Code
#combine diagnosis group variables for plotting
diag_long <- diabetic_data %>%
  select(readmitted, diag_1_group, diag_2_group, diag_3_group) %>%
  pivot_longer(cols = starts_with("diag_"), names_to = "Diagnosis_Level", values_to = "Diagnosis_Group")

#clean label names
diag_long$Diagnosis_Level <- recode(diag_long$Diagnosis_Level,
                                    diag_1_group = "Diagnosis 1",
                                    diag_2_group = "Diagnosis 2",
                                    diag_3_group = "Diagnosis 3")

#plot bar charts
p <- ggplot(diag_long, aes(x = fct_infreq(Diagnosis_Group), fill = readmitted)) +
  geom_bar(position = "dodge") +
  scale_fill_viridis_d(option = "D", begin = 0.1, end = 0.9) +  #color-blind friendly
  labs(
    title = "Readmission Count by Diagnosis Level (diag_1, diag_2, diag_3)",
    x = "Diagnosis Group",
    y = "Number of Patients",
    fill = "Readmitted"
  ) +
  facet_wrap(~ Diagnosis_Level, ncol = 1, scales = "free_x") +
  theme_minimal(base_size = 10) +
  theme(
    axis.text.x = element_text(angle = 35, hjust = 1, face = "bold"),
    strip.text = element_text(face = "bold"),
    legend.position = "right"
  )
ggplotly(p, height = 600)
Correlation Between Numeric Variables

The heatmap below presents the correlation matrix for the numeric features in the dataset, offering a snapshot of how these numeric features are correlated within the dataset. Overall, the correlations revealed that most numeric features are moderately correlated, indicating that each numeric features provide different types of information rather than association. key observations include:

  • A moderate positive correlation between time_in_hospital and both num_lab_procedures (0.33) and num_medications (0.43), indicating longer hospital stays are associated with more procedures and medication.
  • A low positive correlation between between most features, such as number_inpatient and num_procedures, indicating their independent value.

The correlation matrix supported the inclusion the dataset numeric features in the classifier modeling, as they appear to contribute unique information that supports the underlining classification goal to predict diabetic patient readmission status within 30 days of discharge.

Code
#identify numeric columns
numeric_vars <- diabetic_data[sapply(diabetic_data, is.numeric)]
numeric_vars <- numeric_vars[, colSums(!is.na(numeric_vars)) > 0]


#prepare correlation matrix
cor_matrix <- cor(numeric_vars, use = "complete.obs")
cor_df <- melt(cor_matrix)

#base heatmap with better visual harmony
p <- ggplot(cor_df, aes(x = Var2, y = Var1, fill = value)) +
  geom_tile(color = "white") +
  geom_text(aes(label = round(value, 2)), color = "black", size = 3.5) +
  scale_fill_gradient2(
    low = "#440154",      # red for negative
    mid = "white",        # neutral
    high = "#21908C",     # green for positive
    midpoint = 0,
    limits = c(-1, 1),
    name = "Correlation"
  ) +
  labs(
    title = "Correlation Between Patient Numeric Features",
    x = NULL, y = NULL
  ) +
  theme_minimal(base_size = 10) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, face = "bold"),
    axis.text.y = element_text(face = "bold"),
    legend.title = element_text(face = "bold"),
    plot.title = element_text(face = "bold", hjust = 0.5)
  )
ggplotly(p)
EDA Insights and Summary

The EDA provided valuable insights into the factors that most likely to influence hospital readmission among diabetic patients. Though some findings confirmed our initial expectations of the underlying dataset, others revealed more insightful patterns.

  • The target variable readmission revealed that majority of patient were not readmitted or were readmitted after 30 days. A smaller subset of patients, yet medically significant, were readmitted within 30 days, which confirmed a class imbalance that will be accounted for during modeling stage.
  • Demographics such as gender, race and age showcased some variation. Older age group more specifically (60-80) tend to dominate the readmission scene, which matches with chronical conditions such as diabetes.
  • The diagnostic groups helped tremendously with narrowing down most impactfull features. Circulatory and Respiratory diagnoses appear more frequently and subsequently have higher readmission. On the other hand, conditions such as Neoplasms (cancer) suprisngly showed low readmission, and that is likely due to follow-ups being handled by outpatient or via specialized clinic.
  • The correlation heatmaps confirmed a low correlation among numerical features, indicating low redundancy among features, which is ideal for building a classification model as each feature contributes different information.

These findings established a solid foundation to address the evaluation plan and model selection.

5. Evaluation Plan

As the exploratory data analysis of the target variable revealed a clear class imbalance, with the “NO” readmission class accounting for approximately 50% of the entire dataset, evaluation metrics that go beyond accuracy alone are crucial for properly evaluating model performance. To provide a more comprehensive assessment of model performance across all classes, below are the selected evaluation metrics:

  • Accuracy: This metric is used to measure the overall correctness of predictions across all samples, proportion of correct predictions. Accuracy is selected as a baseline comparison measure since it is a quick indicator of overall model performance, but can be misleading when classes are imbalanced.

  • Precision: This metric measures the proportion of predicted positives that were actually correct. Precision for each class (Precision_NO, Precision<30, Precision>30). Evaluates how accurate the model’s predictions are when it predicts a particular class. Precision is important metric when false positives are costly. Thus, this metric is selected because misclassifying a low-risk patient for a high-risk patient is costly for the classification task.

  • Recall (Sensitivity): This metric measures model’s ability to detect positive cases. Recall for each class (Recall_NO, Recall<30, Recall>30). Assesses the model’s ability to correctly identify actual instances within each class. This metric is chosen because it helps assess the model’s capability to distinguish between different readmission timeframes, which is crucial for clinical decision-making and targeted interventions.

  • F1 Score: This metric measures the harmonic mean of precision and recall, capturing the balance between the two. F1 Score for each class measures the harmonic mean of Precision and Recall, providing a balanced evaluation for each class. In imbalanced datasets, F1 is more informative than accuracy as it balances the trade-off between precision and recall, giving a more realistic measure of a model’s performance on the minority class.

  • Macro F1 Score: This metric measures the arithmetic mean of the F1 scores across all three classes, giving equal weight to each class and offering a fair assessment of overall performance. Macro F1 Score treats all classes equally, making it a proper measure of performance for imbalanced dataset, giving insight into whether the model is biased toward the majority class.

Given that this task represents a typical multi-class classification problem with imbalanced class distribution (e.g., NO cases are significantly more common than <30 cases), relying solely on Accuracy is insufficient for a comprehensive evaluation. Therefore, we emphasize per-class Precision, Recall, and F1 scores, which are particularly important for assessing how well the model performs on the high-risk class of early readmission (<30). Additionally, the use of Macro F1 Score helps mitigate the dominance of the majority class and provides a more representative evaluation of model performance.

6. Classification Models

Five classification models are selected for fitting and evaluating the dataset, each suited to the “Diabetes 130-US Hospitals for Years 1999-2008” dataset characteristics and classification goals. These classification models are:

  1. Decision Tree: Decision Tree are well-suited for our diabetes dataset with mixed-type variables and do not require standardization. The model offers high interpretability by allowing traceability of decision paths. It also supports class weighting, which can help improve detection of underrepresented classes such as <30. Evaluation will focus on class-wise recall and macro-averaged F1 score to assess balanced performance.

  2. Random Forest: Random Forest aggregates multiple decision trees to reduce overfitting and better capture complex feature interactions (num_inpatient, num_lab_procedures). It is robust to feature transformations, handles class imbalance effectively, and provides variable importance rankings. Model evaluation includes macro F1, and precision-recall to capture both overall and minority class performance.

  3. Support Vector Machine (SVM): SVM is suitable for our high-dimensional dataset and is capable of learning non-linear decision boundaries via kernel functions. With class weighting enabled(class_weight='balanced'), SVM can better manage imbalance, particularly for the <30 class. Performance will be evaluated using precision, recall, F1, macro F1, AUC, with emphasis on the minority class.

  4. k-Nearest Neighbors (k-NN): k-NN is a distance-based, non-parametric method that relies on similarity across feature space. As it is sensitive to feature scaling, standardization was applied. While k-NN serves as a strong baseline, it may favor majority classes. Therefore, it is used primarily for benchmarking, with evaluation based on per-class recall, per-class precision ,per-class F1, macro F1, Weighted F1 Score and MCC.

  5. Logistic Regression: Logistic regression is appropriate for modeling multi-class classification tasks with interpretable, probability-based outputs. It also supports class weighting, which is helpful for addressing class imbalance. Evaluation metrics includes accuracy, sensitivity, specificity, F1 score, macro F1 and weighted F1 score in both binary and multi-class settings.

7. Innovation

This project team made a deliberate effort to go beyond the project’s requirements and demonstrated innovation across the project plan and EDA. These innovations include: the selection of a challenging and novel dataset that meets all 5 dataset selection guidelines, inclusive and accessible visuals (all plots used color-blind–friendly colors and plots were made interactive with ggplotly to support dynamic exploration), domain-related feature engineering (transformation of the diagnosis features using ICD-9 codes map), and the use of kableExtra library intuitively displaying table results.

8. References

Sharma, Abhishek, Prateek Agrawal, Vishu Madaan, and Shubham Goyal. 2019. “Prediction on Diabetes Patient’s Hospital Readmission Rates.” Proceedings of the Third International Conference on Advanced Informatics for Computing Research, June, 1–5. https://doi.org/10.1145/3339311.3339349.
Strack, Beata, Jonathan P. DeShazo, Chris Gennings, Juan L. Olmo, Sebastian Ventura, Krzysztof J. Cios, and John N. Clore. 2014. “Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records.” BioMed Research International 2014: 1–11. https://doi.org/10.1155/2014/781670.
V R Reji Raj, and Mr. Rasheed Ahammed Azad. V. 2023. “Predictive Analysis of Heterogenous Data for Hospital Readmission.” International Journal of Scientific Research in Science, Engineering and Technology, January, 106–12. https://doi.org/10.32628/ijsrset231012.